Part I - Prosper Loans Dataset Exploration

by Amicah Maina

Introduction

Preliminary Wrangling

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

What is the structure of your dataset?

Your answer here!

  • The dataset contains 113937 rows of data with 81 columns. (ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors)

What is/are the main feature(s) of interest in your dataset?

The BorrowerRate or interest rate of the loans. My focus of investigation is to determine what features are best for predicting Loan BorrowerRate in the dataset?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The following features will help support my investigation

Univariate Exploration

Investigating BorrowerState Variable

We can see from the plot that California has the highest number of prosper loan borrowers followed by three other States. I will have to sort the BorrowerState programmatically to show a clear order of the states in order.

The pandas series function value counts is valuable here. This function counts the frequency of each unique value in the series and then sort them in descending order. In order to get the bar order, we want the index values and then, store them in a variable. We pass this to the order parameter in order to get the sorted bar chart as shown below:

With this we can see the distribution clearly, California having the highest count and then Texas and then Newyork and Florida having an even count. The three later states have almost half the count of California.

As a side note it is good to note that. Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry.

Investigating the CreditGrade variable distribution.

From the data dictionary we can see that the CreditGrade is ordinal data of rating assigned to the loan listing when the listing went live. The order being as follows in ascending order:

0 - NC, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA.

This means AA is the highest rating to be given to a listing. Let's list the listing in a descending order from the highest to the lowest.

The distribution is Unimodal in nature, and We can see CreditGrade 'C' has the highest count followed by 'D' and then 'B'. Further investigation will need to be made on the relationship between this variable and BorrowerRate in bivariate investigation. To see what CreditGrade has the highest BorrowerRates.

Investigating Listing Creation Date Distribution

Investigating ListingCategory variable

From the data dictionary we can note that ListingCategory is a categorical variable. The category of the listing that the borrower selected when posting their listing:

Lets investigate this variable further to see which category has the highest count.

Let's try displaying this in a more descriptive way. Using the original text to get a more clear picture.

The debt consolidation category is ranked highest followed by not-available. It would be worth investigating the LP_InterestandFees and CreditGrades variables associated with this category in multivariate exploration.

Investigating LoanStatus Variable

In order to know what proportion of the LoanStatus falls into each category I will use relative frequency. One way of doing this is to use Matplotlib to change the tick marks from leveling absolute counts, to relative counts. To do this, I'll create a list of tick positions which I'll do by computing the length of the longest bar in terms of proportion. However, the LoanStatus column contains only categorical nominal variables and as a result has to be represented in a numerical form in order to compute the length of the longest bar for the numerator while the original df data frame gives us the denominator, we use value counts to get the numerator. To achieve this I’ll be using Pandas get_dummies function

Since I would like to see the whole distribution it is therefore important to know what the value of the least proportion is in order to accommodate it, if possible, in our numpy arange function.

Since the min_propotion is between 0 and 0.1. I will not be including it in the xticks of the figure, it will be part of my np.arange function, however.

The resulting plot shows us that more than 40% of all loanstatus is current while about 35% of loanstatus is completed. There are also about 10% defaulted loanstatus. The rest taking up the 15% left.

Investigating Investors variable distribution

The distribution of Investors values appears right skewed, with relatively few points above 400 in value but there isn't a lot of detail beyond the 400 value.

By default, matplotlib will split the data into 10 bins, which is usually too few but in this case too small because we would need to zoom in. We can use the matplotlib function xlim to set the upper and lower bounds of the bins that will be displayed in the plot. This function takes an argument, a list, or tuple of two values specifying these limits. Here, I'll make use of Numpy's arange function to create bins of size thirty. The first argument is the minimum value, while the second argument, will be the maximum value. The third argument gives the step-size for the bins.

I will add plus thirty to the second argument. This is because the values generated by arange will not include the maximum value. By adding plus thirty, this means that all values should be represented in the chart. When I add the bins list to the Hist Function, this creates a view of the data that implies the data is right skewed, save for the presence of possible outliers beyond the 600 value range.

As can be seen from the plots above, the right skewed histograms indicate most loans have a small number of investors. We will check how the number of investors affect the BorrowerRate of the loans.

Investigating MonthlyLoanPayment Variable Distribution

The MonthlyLoanPayment appear partly right skewed with a lot of MonthlyLoanPayment in the lower end. This might be due to its correlation with the Loan Amount. We will observe this at the Bivariate Exploration.

Investigating BorrowerAPR Variable Distribution

BorrowerAPR is the Borrower's Annual Percentage Rate (APR) for a loan, this is the annual interest rate for a loan. The plot above shows a slightly right skewed graph indicating most loans have less than 0.3% of Borrower APR.

Investigating BorrowerRate Variable Distribution

The Borrower Rate is the interest rate of a loan. Same case as the Borrower APR can be observed. Most of the borrower interest rates can be observed on the lower percentages of the Borrower Rate metric.

Investigating ProsperRatingNumeric Variable Distribution.

As indicated on the dictionary the ProsperRating Numeric is credit rating done by prosper on the borrowers. The barchart indicates more of the borrowers have a rating of 4 and less have 7. We can relate this rating with BorrowerRate and see how it affects the distribution of BorrowerRate.

Investigating ProsperScore Variable Distribution.

According to the data dictionary the The score ranges from 1-11, with 11 being the best, or lowest risk score. Applicable for loans originated after July 2009.

It looks like for the biggest bulk of loans has a risk score of 4 down to risk score 1 . We will investigate further on how this affects the borrower rate of these loans in bivariate exploration.

Investigating LoanOriginalAmount Variable

The initial plot of the LoanOriginalAmount shows some immediate points of attention. On the LoanOriginalAmount you would notice sudden spikes and some other very high figures in the far right. It's worth taking a bit of time to identify these outliers and see if they need to be filtered out of the data.

Remove Outliers

The DebtToIncomeRatio is right skewed which means most people have more income than debt with 99% being below 0.86. More in depth look into the relation between the ratio and the Loan Original Amount and the borrower rate assigned.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

I converted categorical columns

into ordered categorical type, for better analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There was a need to rename columns:

This was to ensure conformity of the data.

Bivariate Exploration

With respect to our feature of interest, namely, predicting interest rate ( BorrowerRate), we have from the correlation heat map a very strong positive correlation with LenderYield, BorrowerAPR and EstimatedLoss, while we have a moderate negative correlation with CreditScoreRangeLower and CreditScoreRangeUpper.

The other features examined showed very weak correlation with BorrowerRate and as a result we will first concentrate on the aforementioned variables.

Investigating relationship between the BorrowerRate and BorrowerAPR, LenderYield, and EstimatedLoss.

According to the data dictionary the BorrowerRate is the interest rate for a loan while the BorrowerAPR is the Annual Percentage Rate or a calculated total annual interest rate of the loan.

LenderYield is equal to the interest rate on the loan less the servicing fee.

The Estimated Loss is the estimated principal loss on charge-offs.

The three definitions indicates a strong correlation between the four variables. Let us visualize them to confirm this.

The gradient of the line plotted for all three variables against the BorrowerRate indicate a positive correlation and linear relationship which means that an increase in one of these variables would also mean an increase in the BorrowerRate.

Investigating relationship between CreditScoreRanges and BorrowerAPR.

On the plot above we can see a strong negative correlation between both creditscore ranges and the BorrowerRate. This illustrates the effect of the credit score to the interest rate of a loan. More indept visualization is required to check on the distribution of the loans. A combination with the ProsperRatingNumeric can bring more insight into this.

Investigating relationship between ProsperScore and BorrowerAPR.

On the heat-map we see a strong negative correlation between the prosperscore and the borrowerrate. I have used a FacetGrid to further investigate this and show the correlation more in depth. The grid shows the BorrowerRate count depending on the ProsperScore. We can observe the trend with the lower scores, where the histograms are left-skewed and then becoming more right-skewed as the score improves. This shows that the lower the risk score the more the BorrowerRate and vice-versa

Investigating relation between DebtToIncomeRatio and BorrowerRate.

The relation shows a weak Positive correlation, which a higher DebtToIncomeRatio might result to a slightly higher BorrowerRate . We also see a slight abnomality which shows some of the loans with low DebtToIncomeRatio having a high BorrowerRate Let us further investigate these loans. We can check their relation to LoanOriginalAmount

The LoanOriginalAmount is relatively low for this sample of loans. Let us check their relation to the CreditGrade and the ProsperScore

On further investigation we can see the score and credit grade assigned to most of these loans are low. Which might mean due to the higher risk for the company, the borrower rate assigned was higher.

Investigating relationship between the BorrowerRate and the LoanOriginalAmount.

There appears to be a weak negative correlation between the LoanOriginalAmount and the BorrowerRate.

Investigating relationship between the BorrowerRate and the MonthlyLoanPayment.

We can observe a negative correlation between the monthlyloanpayment and BorrowerRate.

Investigating the relationship between selected Categorical variables and Numerical Variables

From the boxplot above it becomes even more clear that both prosperRatingNumeric and prosperRatingAlpha have similiar correlation with borrowerRate, as a result i will use just one in the next plot. it also seem to me that loanstatus is not a good measure for estimating borrowerRate, so i will also drop that. EmploymentStatus doesn't seem to have a strong correlation with borrowerrate, so i'll drop that as well.

Additionally, the BorrowerAPR and BorrowerRate relate to the categories similarly as they are linearly related.I will therefore drop the BorrowerAPR.

The plot above shows us a clearer picture into the relationship between the categorical variables with the numerical variables. We can observe the relationship between the Income Range and the Loan Original Amount where borrowers with higher income ranges take or borrow bigger loan amounts as compared to those with a low income range. The box plot boundaries indicating more loan amounts increase gradually as the income range. This shows a positive correlation between the two variables. We can see however a negative correlation to the borrower rate and lender yield which indicates borrowers with higher income ranges get lower borrower rates and in turn lower lender yields.

We can also observe negative correlation between the borrower rate and lender yield variables with Credit Grade and Prosper Rating Alpha. And positive correlation between the Credit Grade and ProsperRating Alpha variables with the loan amount indicating higher credit rating can allow borrowers to acquire higher loan amounts.

The relationship between the numerical variables and the Listing category however needs more indepth analysis to show its effect on them.

Investigating the relationship between CreditGrade and EstimatedLoss

It appears that for each EstimatedLoss Entry the CreditGrade column entry is null. This shows an issue with the data source. Therefore we will use the ProsperRatingNumeric as a replacement for CreditGrade as they both have the same grading system according to the data dictionary. Lets take a look at ProsperRatingNumeric.

Investigating the relationship between ProsperRatingNumeric with ProsperScore and EstimatedLoss

This served our goal better. We can observe that the higher the rating assigned to borrowers, the lower EstimatedLoss for the loans. Also we can see a negative correlation between the variables prosper score and estimated loss which indicates the better the risk score for a loan the less the estimated loss.

Investigating the relationship between BorrowerRate and EmploymentStatus

In the plot above we can observe the difference in interest rates per employment status. Right skewed histograms can be observed for employment statuses self-employed, employed, full-time and other. Which indicates more favorable borrower rates for these employment statuses. An indepth analysis is however required. I will combine the employment statuses with their income ranges to get a better picture into the relationship between the two variables.

Investigating the relationship between Occupation, ProsperScore and BorrowerRate

In the data we have 68 Occupations and a ProsperScore from 1-11.

Let us explore the risk factor(ProsperScore) calculated for each occupation. Check which occupation has a higher risk score. Then we can check each risk score and the count of the borrowerrate assigned to loans for each score.

This relation can help us determine which occupation has a likeliness to get a higher borrowerrate.

On the plots above we can observe relatively equal borrower rates are assigned to different occupations. This however is not a clear depiction of the relationship, and due to the weakness in correlation between the two variables, I will not be pursuing this line of investigation.

Due to the similarity in meaning between Credit Grade and Prosper Rating Numeric I have depicted both of them to show the relationship between the borrower rate and these credit rating variables. The change of the histograms on the plots above show lower ratings having borrower rates with a higher percentage, than higher ratings which have lower percentage borrower rates. This indicates the effect of credit rating on the borrower rates, the lower the rating the higher the interest rate and vice versa.

Investigating the relationship between BorrowerRate and Investors

As shown earlier in univariate exploration where investors histogram was right skewed histogram showing more borrowers had lower amounts of investors. I sought to show the effect of those investors on the borrower rate of the loans. The graph above shows a weak negative correlation between the variables. Indicating more investors in a loan could decrease slightly the borrower rate assigned to those loans.

Investigating the relationship between BorrowerRate and IncomeRange

The plots above are meant to extrpolate further the relationship between the IncomeRange and our focus variable borrower rate. We can see the histograms for borrower rates get more and more right skewed as the income range increases. The point plot shows a slope to the right from income range 1-24999 as the income range increases to 100000$. Showing clearly the effect of more income on the borrower rate of a loan, where borrowers with higher income ranges are more probable to get lower borrower rates.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Upon my investigations, the following observations were made. The following showed a strong positive correlation with BorrowerRate:

Which means an increase in any of these variables or all would mean a consecutive increase in the BorrowerRate.

The following variables show a moderate correlation with BorrowerRate

These variables show a weak negative correlation with BorrowerRate.

According to the data dictionary the following assessment variables exist; I will list them with their relation to BorrowerRate;

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

Investigating ProsperScore in relation to BorrowerRate and LoanOriginalAmount.

The line gradient of the graphs show a change in the correlation between the BorrowerRate and LoanOriginalAmount while the ProsperScore rises. A slight positive correlation can be observed on score 11 showing a higher loan amount can result in a slight increase in the BorrowerRate. While a lower score can result in a higher BorrowerRate but would decrease as the Loan amount increases.

Effect of the Risk Score(ProsperScore) on the DebtToIncomeRatio and BorrowerRate relation

In an attempt to show the effect of the prosper score to the weak positive correlation between the Borrower Rate and the DebttoIncomeRatio, we can see the risk score weaken further the relation. Which shows that a better risk score that indicates less risk for the loan can reduce the effect of the DebtToIncomeRatio therefore resulting into lower borrower rates for higher debttoincomeratio.

We can see that with a lower prosperscore, the BorrowerRate assigned to loans starts at a higher limit rising from there as the ratio increases. With a higher ProsperScore the BorrowerRate starts at a lower limit rising as the ratio increases.

Investigating the effect of the credit rating (ProsperRatingNumeric) to the correlation between LoanOriginalAmount and BorrowerRate

On the graph above we can see the concentration of the loans lower to the lower borrower rates limits, which indicates the effect of the higher credit rating for the loans. However, we can also see the weak negative correlation between the borrowerrate and loan amount weakened further as the borrower rate limit is reached. We can see that despite the higher credit rating of the borrowers the loans were not assigned lower borrower rates especially on the highest rating score indicaing a limit to the interest rates

Effect of the IncomeRange to the correlation between ProsperScore and BorrowerRate.

On this plot we can see the strengthening effect of the prosperscore and the income range on the borrower rate. We can observe deeper colored points on the lower end of each score indicating that despite the rating score the higher the income range the less the borrower rate. This effect combined with the decrease in the borrower rates as the risk score improves shows a strengthening factor by the two variables on the borrower rates.

Same can be observed for the prosper rating scale graph.

Due to the effect of the rating and risk score on other variables. I decided to combine both to show their effect on each other. We can observe that due to the strong strengthening factor by the two variables, the distribution of the borrower rates appears more predictable. Where low borrower rates were assigned to loans with a high prosper rating score and as the risk score improves also the high borrower rates depleted to show even lower borrower rate for the better risk score. This shows that a loan with a high credit rating and the best risk score will only get lower borrower rates.

Change in the correlation between IncomeRange and BorrowerRate depending on the EmploymentStatus

Previously we discussed the effect of the income range to the borrower rates. Where the higher the income range the lower the borrower rates.We also decided to investigate further its effect on the employment status relation with the borrower rate. We can therefore observe on the plot above, the better point plot slopes of the employment statuses employed, full-time and part time as the incomeranges increase. Indicating that employed borrowers received lower borrower rates and even lower as the income ranges increased.

Effect of ProsperScore and ProsperRating Numeric on BorrowerRate and EstimatedLoss correlation.

On the plot above we can see the effect of the prosperscore on the relationship between estimated loss and borrowerrate. We see that the relationship between the two variables does not weaken, however, we can also see the change in the concentration of loans as the prosper risk score improves. We can see a lower concentration of loans on the higher borrower rates and estimated loss scale as the risk score improves. Indicating lower estimated loss and lower borrower rates for loans with better risk scores.

To get a clear picture we will need to use color palettes and differentiate the loans.

The plot above better shows the effect demonstrated before. The concentration of loans with a lower borrower rate and estimated loss are those with the best risk score. The color palette also shows a change in the score for higher borrower rates.

The same effect demonstrated by the risk score can be seen on the graph above. Showing the effect of the prosper rating to the borrower rates and estimated loss.

Effect of ProsperRatingNumeric on the relationship between BorrowerRate and BorrowerAPR

Earlier we saw the strong positive correlation between the BorrowerAPR and the Borrower Rate, here we seek to investigate the effect of the prosperrating numeric on that relationship. As observed we can see the heat map indicate a concentration change on the loans as the rating score changes. The concentration of the loans lower to the lower borrower rates and borrower APR as the rating improves.

We can also observe on the graph above the same effect demonstrated by prosperratingnumeric, as the income range improves the loans concentrate more on the lower Borrowerrates and BorrowerAPR scales.

On the plot above, we can observe the strengthening factor of the income range and the prosper rating numeric on the Interest rates. We can see the concentration of the loans with the lowest credit rating and income range is on the highest side of the borrower rates and borrower APR scales. The concentration also changes to the lowest side of the BorrowerRates and BorrowerAPR scales as the credit rating and the income range improve. This shows that higher credit rates and income ranges will result into lower Interest rates.

On this plot we show the change of the relationship between the BorrowerRate and Loan Amount depending on the employment status of the borrower. We see a contrast in the gradient of the slopes for employment statuses; employed, other, full-time, retired, part-time and self-employed to the statuses not-available and not-employed . We can see that borrowers with the status not-employed and not available started at higher borrower rates and showed weaker negative correlation as compared to the others.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

On further exploration, we can observe the effect of a ProsperScore on the relation between the EstimatedLoss and the BorrowerRate. The concentration of loans with a lower BorrowerRate and low EstimatedLoss can be observed increasing for higher ProsperScores.

The variables ProsperRatingNumeric and ProsperScore seem to strengthen each others influence on the BorrowerRate assigned. Where a higher ProsperScore and ProsperRatingNumeric resulted into a lower BorrowerRate.

More so we can observe that for higher IncomeRanges and CreditGrade, more loans had a lower BorrowerRate as compared to those with lower IncomeRanges and CreditGrade. An additional correlation with the BorrowerAPR affirms this observation.

Combining the LoanOriginalAmount and the EmploymentStatus showed a steeper gradient for employed Borrowers as compared to Unemployed. With the correlation being negative. This means a higher BorrowerRate can be expected for unemployed individuals as compared to employed.

Due to the value count of the borrowers employment status being high for employed, full time and self employed as compared to the others combination of such as the EmploymentStatus of an individual combined with the EstimatedLoss and BorrowerRate of the loans showed little result in the graph.

Were there any interesting or surprising interactions between features?

Additionally, with the BorrowerRate and LoanOriginalAmount variables combined with the ProsperScore, the gradient of the lines change gradually from a negative correlation to a weak positive correlation at score 10 and 11. Showing a higher loan amount can result in a slight increase in the BorrowerRate, if the ProsperScore is high enough.

Conclusions

The prosper loans dataset contains over 100k observations with 81 variables spanning across 9 years.Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending was the first obstacle in approaching this dataset. After resolving this, I assessed and cleaned the data for any issues.

My main investigation was geared towards predicting loan BorrowerRate (Interest Rate) based on its relation to the other vairables.

The main variables of interest I identified were:

For each of these variables I first performed Univariate Exploration of them to identify their characteristics. For instance, I realized the CreditGrade, ProsperRatingNumeric and ProsperRatingAlpha are based on the same concept. Which according to the dictionary are grading variables. I then proceeded to visualize their nature and check on their value_counts.

Then I matched up the data against the BorrowerRates and each other in Bivariate Exploration. This led to significant discoveries such as the necessity to use ProsperRatingNumeric instead of CreditGrade while comparing it to the Estimated Loss. Since for each entry on the EstimatedLoss field a null entry was observed for CreditGrade.

Main discoveries were made on this stage which include,

However, the one ongoing hurdle was determining which variables to analyze, not drifting too far off any one path of investigation and not pulling in new variables throughout the process. Another persistent issue was overplotting on scatterplots, a number of techniques were used across multiple plots to acquire the results.

Additional Variable I considered are:

Success was found in many areas, the general analysis revealed areas of interests such as negative correlation between ProsperScore and BorrowerRate which showed a trend of decrease of the BorrowerRate dependent on the increase in the ProsperScore. Also, trends were confirmed and unexpected, unknown relationships such as those between loan amount and the number of investors contributing to that loan, were revealed.

I also categorize success as to areas that were discovered which need to be further investigated. I believe additional time in multivariate analysis on variables such as occupation, income range, loan category and BorrowerRate would expose more trends and perhaps allow for more predictions. Also, having only analysed 18 of the original 81 variables leaves a lot of undiscovered relationships/correlations and trends.

Additional data would also enhance this dataset. Having the borrower’s age and sex would allow analysis to possibly discover trends among men and women or young and old. Also, population and state-average-income features, would allow for discovery of the type of environment the borrower lived in. For example, a borrower at $75,000 income living in California could be considered middle class within that State. This is in comparison to a borrower from Texas in the same income range bracket which might be consider in the upper class - Does your class (lower, middle or upper) help determine your BorrowerRate? This would be dependent on the borrower’s state and the state’s average income range. These types of questions and more could be answered with additional data.